Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Bench To Configure Data Processing Pipeline Per Scenario #60

Merged
merged 5 commits into from
Aug 2, 2024

Conversation

fabianlim
Copy link
Contributor

@fabianlim fabianlim commented Jul 31, 2024

This PR allows bench to have a data_processing stanza, currently we have two styles

  1. a functional recipe style, where the formatting functions need to be implemented in python.
  2. a jinja style, where the formatting is given in a template

Loss Masking

  • loss masking is done automatically if --response_template is passed into the arguments.

Older Style

In this style, the recipe is specified by the formatting flag. This requires updating python code to understand
what needs to be done for that particular recipe

  • also for the recipe to understand the fields of the dataset, you need to specify input_field and so on
  • this style is not so good, as everything is opaque. It requires opening the data processing functions to know exactly what processing is happening.
data_processing:
  dataset_name: yahma/alpaca-cleaned
  formatting: "instruct"
  tokenize: True
  input_field: input

Chat Templates

In this style we rely on HF's integration of chat templating.

  • this is more flexible and the preferred approach.
data_processing:
  dataset_name: yahma/alpaca-cleaned
  chat_template: |
    {%- for message in messages %}
        {% if message['input'] != '' %}
    Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

        {% else %}
    Below is an instruction that describes a task. Write a response that appropriately completes the request.

        {% endif %}
    ### Instruction:
    {{ message['instruction'] }}

        {% if message['input'] != '' %}
    ### Input:
    {{ message['input'] }}

        {% endif %}
    ### Response:
    {{ message['output'] + eos_token }}
    {% endfor %}
  tokenize: True

@fabianlim fabianlim requested a review from achew010 July 31, 2024 13:39
@fabianlim fabianlim marked this pull request as draft July 31, 2024 13:39
@fabianlim fabianlim force-pushed the bench-tokenize branch 3 times, most recently from 7d200de to b3f43b7 Compare August 1, 2024 08:23
Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Yu Chin Fabian Lim <[email protected]>
@fabianlim fabianlim marked this pull request as ready for review August 1, 2024 16:41
@fabianlim fabianlim merged commit 0e51785 into main Aug 2, 2024
6 checks passed
@fabianlim fabianlim deleted the bench-tokenize branch August 2, 2024 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant